CzeSL – an error tagged corpus of Czech as a second language
نویسندگان
چکیده
Using an error-annotated learner corpus as the basis, the goal of this paper is two-fold: (i) to evaluate the practicality of the annotation scheme by computing inter-annotator agreement on a non-trivial sample of data, and (ii) to find out whether the application of automated linguistic annotation tools (taggers, spell checkers and grammar checkers) on the learner text is viable as a substitute for manual annotation.
منابع مشابه
Corpus based coreference resolution for Farsi text
"Coreference resolution" or "finding all expressions that refer to the same entity" in a text, is one of the important requirements in natural language processing. Two words are coreference when both refer to a single entity in the text or the real world. So the main task of coreference resolution systems is to identify terms that refer to a unique entity. A coreference resolution tool could be...
متن کاملError-Tagged Learner Corpus of Czech
The paper describes a learner corpus of Czech, currently under development. The corpus captures Czech as used by nonnative speakers. We discuss its structure, the layered annotation of errors and the annotation process.
متن کاملPAYMA: A Tagged Corpus of Persian Named Entities
The goal in the named entity recognition task is to classify proper nouns of a piece of text into classes such as person, location, and organization. Named entity recognition is an important preprocessing step in many natural language processing tasks such as question-answering and summarization. Although many research studies have been conducted in this area in English and the state-of-the-art...
متن کاملImprovements to Korektor: A Case Study with Native and Non-Native Czech
We present recent developments of Korektor, a statistical spell checking system. In addition to lexicon, Korektor uses language models to find real-word errors, detectable only in context. The models and error probabilities, learned from error corpora, are also used to suggest the most likely corrections. Korektor was originally trained on a small error corpus and used language models extracted...
متن کاملCorpus-Based Rules for Czech Verb Discontinuous Constituents
In this paper we present a method for extracting general structures of the verb groups from a tagged and fully disambiguated corpus and consecutive exploitation of these structures for the building a formal grammar in the Prolog DCG fashion. Our goal is to apply them as a rules for the analysis of the Czech verb groups in the nondisambiguated grammatically tagged Czech corpus texts. The problem...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2011